Experimental GGUF-2-PTE Converter #13266

dillondesilva · 2025-08-10T12:47:28Z

Summary

This PR is not intended for merge. Instead, it is designed to demonstrate a potential method under which .gguf files can be converted to .pte by leveraging some of the existing transformers ecosystem.

The key idea is the following:

A GGUF model is loaded via the transformers library into a suitable auto class.
Sample tokens are generated using the model tokeniser and a dummy sentence.
GGUF models are then torch exported
The exported torch program is then lowered and exported to an Executorch format .pte

Early Learnings & Limitations

Attached code generates a `.pte` model that is yet to be tested

The experiment in this PR converts SmolLM2-135M-Instruct-Q8_0.gguf into a .pte file. However, testing if this model works as expected within the executorch runtime is to be confirmed. This may also require some conversion of the tokenizer (I'm not too sure but would be interested to know, its probably in the docs somewhere)?

Torch export errors can be...scary

Conceptually, I thought that running this experiment would hopefully not be too difficult. However some of the setup issues around getting a reliable torch export for certain models proved to be quite challenging. This could just be due to my lack of knowledge on torch.export, however I think it may also offer insight into the opinion of a developer who wants to focus more on having a smooth experience with converting to .pte files.

Smoothness of conversion depends largely on model being converted: Tried converting a GGUF for LFM-2 using the same code attached (only changing model_id and filename) and its crazy how scary some of the errors became. Complex traces and rooted in various operators/potentially unsupported ops appeared in the logs. Of course, LFM-2 is quite a new model so perhaps I gave it an unfair example but nonetheless, it could be interesting to see where's the limit of models that can comfortably converted via this workflow.

cc @lucylq

…workflow is shown. Functional but prone to lots of random whacky torch export errors

pytorch-bot · 2025-08-10T12:47:32Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/13266

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 2ced5c5 with merge base c8a0706 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2025-08-10T12:48:06Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

metascroy · 2025-08-11T16:33:23Z

examples/converters/gguf2pte.py

+
+torch_dtype = torch.float32
+tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)
+model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=filename)


@dillondesilva what dtype are the weights after loading a GGUF model? Are they dequantized to FP32?

If so, I'm not sure this is really a converter in the sense that it doesn't preserve the quantization from GGUF.

But it is a good start, especially for getting the model structure. We just need to parse the GGUF weights and convert them to int_data/scales/zeros so we can reroute to a kernel. We did have a rudimentary converter for GGUF in torchchat that supported Q4_0 and Q6_K, but this is no longer a popular format.

We could probably start by trying to support Q4_K_M, which requires support for Q4_K and Q6_K. Here is a vibe-coded version of this for Q4_K (so no guarantee that it's correct, but it looks reasonable):

# pip install gguf numpy import numpy as np import gguf # ---- helpers ---- def _fp16le_to_f32(buf_mv): return np.frombuffer(buf_mv, dtype="<f2", count=1).astype(np.float32)[0] def _unpack_q4k_scale_min_codes(bytes12: memoryview): """Return two arrays (8,) of 6-bit integers for sub-block scales and mins.""" b = np.frombuffer(bytes12, dtype=np.uint8) # Layout per llama.cpp wiki ("Tensor Encoding Schemes"): # 0: EEAAAAAA 1: FFBBBBBB 2: GGCCCCCC 3: HHDDDDDD # 4: eeaaaaaa 5: ffbbbbbb 6: ggcccccc 7: hhdddddd # 8: eeeeEEEE 9: ffffFFFF 10: ggggGGGG 11: hhhhHHHH S0_3 = b[0:4] & 0x3F S4_7 = ((b[0:4] >> 6) & 0x03) | ((b[8:12] >> 4) << 2) M0_3 = b[4:8] & 0x3F M4_7 = ((b[4:8] >> 6) & 0x03) | ((b[8:12] & 0x0F) << 2) S = np.concatenate([S0_3, S4_7]).astype(np.float32) # (8,) M = np.concatenate([M0_3, M4_7]).astype(np.float32) # (8,) return S, M def extract_q4k(gguf_path: str, tensor_name: str): """ Returns: q_codes : (n_super, 256) uint8 -- 4-bit codes per superblock (values 0..15) scales : (n_super, 8) float32 -- per-subblock scale (real units) mins : (n_super, 8) float32 -- per-subblock min/offset (real units) d, dmin : (n_super,) float32 -- super-scales used to decode the 6-bit fields Notes: - Each superblock covers 256 weights = 8 sub-blocks * 32 each. - Reconstruct weights for sub-block j: w = scales[i,j] * q - mins[i,j] - Zero-point (affine form): z = mins / scales (can be fractional) """ r = gguf.GGUFReader(gguf_path) t = r.tensors_map[tensor_name] raw = memoryview(t.data) # Superblock layout (Q4_K): # [d fp16][dmin fp16][12B packed S/M codes][128B 4-bit codes] stride = 2 + 2 + 12 + 128 # 144 bytes n_super = len(raw) // stride assert len(raw) % stride == 0, "Unexpected Q4_K tensor byte length" d = np.empty(n_super, dtype=np.float32) dmin = np.empty(n_super, dtype=np.float32) S_all = np.empty((n_super, 8), dtype=np.float32) M_all = np.empty((n_super, 8), dtype=np.float32) Q_all = np.empty((n_super, 256), dtype=np.uint8) off = 0 for i in range(n_super): # two fp16 super-scales d[i] = _fp16le_to_f32(raw[off:off+2]); off += 2 dmin[i] = _fp16le_to_f32(raw[off:off+2]); off += 2 # packed 6-bit sub-scales / sub-mins s12 = raw[off:off+12]; off += 12 S6, M6 = _unpack_q4k_scale_min_codes(s12) # realize to real units S_all[i, :] = d[i] * S6 M_all[i, :] = dmin[i] * M6 # 128 bytes => 256 4-bit codes codes_b = np.frombuffer(raw[off:off+128], dtype=np.uint8); off += 128 q_low = (codes_b & 0x0F).astype(np.uint8) q_high = (codes_b >> 4).astype(np.uint8) Q_all[i, 0::2] = q_low Q_all[i, 1::2] = q_high return Q_all, S_all, M_all, d, dmin # ---- Example usage ---- # q, s, m, d, dmin = extract_q4k("model.gguf", "model.layers.0.self_attn.q_proj.weight") # # Dequantize one superblock 'i', sub-block j (32 weights): # i, j = 0, 3 # w_block = s[i, j] * q[i, j*32:(j+1)*32].astype(np.float32) - m[i, j] # # Optional affine form zero-point: # z_block = m[i, j] / s[i, j]

Now we don't currently have any quantized kernels that will handle floating point zeros (in XNNPACK or elsewhere), but I could quickly put up a patch to support that for our lowbit kernels in a day or two.

Thanks for the example, the flow looks quite clean. Agree with @metascroy that we may need some custom weight conversion.

I was imagining we could export a PTE file without weights, and plug in gguf weights at runtime, but that also requires some more work on export/runtime before it's possible.

Good catch about the weights being dequantized. I pushed a quick update and it does seem that the GGUF weights are dequantized to FP32 (also found it on the docs)

As you've mentioned, it would be great to have some sort of a conversion module we route the model through once the GGUF has been loaded by HF.

What would be the best path forward for development? Do we want an RFC/some abstractions in this PR we can use to capture this process + any additional steps (e.g. dtype conversion)?

lucylq · 2025-08-11T17:46:35Z

cc @swolchok on gguf-pte conversion

swolchok · 2025-08-11T17:50:51Z

Are the models in the transformers library guaranteed to be exportable? I was under the impression that we generally needed to curate exportable versions of LLMs at this stage in our development, but perhaps I am out of date.

Also CC @mergennachin

dillondesilva · 2025-08-12T10:40:13Z

Are the models in the transformers library guaranteed to be exportable? I was under the impression that we generally needed to curate exportable versions of LLMs at this stage in our development, but perhaps I am out of date.

Also CC @mergennachin

@swolchok Good point! Welp I haven't done a detailed analysis on this but I think its largely dependent on model architecture/operations within it - feel free to have a play around with changing the model_id + filename in the attached script. Last I checked, LFM2-350M-GGUF did not have a smooth export experience.

Perhaps there's some commonalities between what exports well and what doesn't. If we investigate this, it could help understand the limitations of what is/isn't exportable.

Updated gguf2pte converter. Experimental code to scaffold and depict …

8631f0d

…workflow is shown. Functional but prone to lots of random whacky torch export errors

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 10, 2025

dillondesilva changed the title ~~Updated gguf2pte converter. Experimental code to scaffold and depict …~~ Experimental GGUF-2-PTE Converter Aug 10, 2025

metascroy reviewed Aug 11, 2025

View reviewed changes

updated to print model weights dtype

2ced5c5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Experimental GGUF-2-PTE Converter #13266

Experimental GGUF-2-PTE Converter #13266

Uh oh!

dillondesilva commented Aug 10, 2025

Uh oh!

pytorch-bot bot commented Aug 10, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Aug 10, 2025

Uh oh!

metascroy Aug 11, 2025

Uh oh!

lucylq Aug 11, 2025 •

edited

Loading

Uh oh!

dillondesilva Aug 12, 2025

Uh oh!

lucylq commented Aug 11, 2025

Uh oh!

swolchok commented Aug 11, 2025 •

edited

Loading

Uh oh!

dillondesilva commented Aug 12, 2025

Uh oh!

Uh oh!

Experimental GGUF-2-PTE Converter #13266

Are you sure you want to change the base?

Experimental GGUF-2-PTE Converter #13266

Uh oh!

Conversation

dillondesilva commented Aug 10, 2025

Summary

Early Learnings & Limitations

Attached code generates a .pte model that is yet to be tested

Torch export errors can be...scary

Uh oh!

pytorch-bot bot commented Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/13266

✅ No Failures

Uh oh!

github-actions bot commented Aug 10, 2025

This PR needs a release notes: label

Uh oh!

metascroy Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

lucylq Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dillondesilva Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

lucylq commented Aug 11, 2025

Uh oh!

swolchok commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dillondesilva commented Aug 12, 2025

Uh oh!

Uh oh!

Attached code generates a `.pte` model that is yet to be tested

pytorch-bot bot commented Aug 10, 2025 •

edited

Loading

This PR needs a `release notes:` label

lucylq Aug 11, 2025 •

edited

Loading

swolchok commented Aug 11, 2025 •

edited

Loading